{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 101 - Training and Evaluating Classifiers with `mmlspark`\n",
    "\n",
    "In this example, we try to predict incomes from the *Adult Census* dataset.\n",
    "\n",
    "First, we import the packages (use `help(mmlspark)` to view contents),"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's read the data and split it to train and test sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataFilePath = \"AdultCensusIncome.csv\"\n",
    "import os, urllib\n",
    "if not os.path.isfile(dataFilePath):\n",
    "    urllib.request.urlretrieve(\"https://mmlspark.azureedge.net/datasets/\" + dataFilePath, dataFilePath)\n",
    "data = spark.createDataFrame(pd.read_csv(dataFilePath, dtype={\" hours-per-week\": np.float64}))\n",
    "data = data.select([\" education\", \" marital-status\", \" hours-per-week\", \" income\"])\n",
    "train, test = data.randomSplit([0.75, 0.25], seed=123)\n",
    "train.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`TrainClassifier` can be used to initialize and fit a model, it wraps SparkML classifiers.\n",
    "You can use `help(mmlspark.TrainClassifier)` to view the different parameters.\n",
    "\n",
    "Note that it implicitly converts the data into the format expected by the algorithm: tokenize\n",
    "and hash strings, one-hot encodes categorical variables, assembles the features into a vector\n",
    "and so on.  The parameter `numFeatures` controls the number of hashed features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark.train import TrainClassifier\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "model = TrainClassifier(model=LogisticRegression(), labelCol=\" income\", numFeatures=256).fit(train)\n",
    "model.write().overwrite().save(\"adultCensusIncomeModel.mml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the model is trained, we score it against the test dataset and view metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel\n",
    "predictionModel = TrainedClassifierModel.load(\"adultCensusIncomeModel.mml\")\n",
    "prediction = predictionModel.transform(test)\n",
    "metrics = ComputeModelStatistics().transform(prediction)\n",
    "metrics.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we save the model so it can be used in a scoring program."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.write().overwrite().save(\"AdultCensus.mml\")"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
